library(slendr)
init_env()
# <replace this with `population()` definitions like in the slides>
# <replace this with your gene-flow definition in variable `gf`>
model <- compile_model(
populations = list(...), # <put your list of populations here>
gene_flow = gf,
generation_time = 30
)Building intuition into popgen fundamentals and inference
Exercises for the Workshop on population and speciation genomics
Reference materials
As you will be working on the exercises below (all of which focus on the popgen simulation R package slendr), some reference materials will be useful:
First and foremost, you should refer to the slides with the slendr “crash course” which we started our activity session with, and which cover pretty much every bit of slendr functionality you will need. You can find the material rendered as normal slides or one-page handouts (the latter will be a bit more practical for reference, as you won’t have to search for information by flipping through individual slides). If these Quarto Pub links are not working, it could be because their servers are overloaded. In this case, you can simply open
slides.htmlorhandouts.htmlfrom the GitHub repository for this activity (you will have them also in the cloud – instructions below).Manual pages of all available slendr functions. Note that you can get the help page of every slendr R
function()by typing?functionin the R console, and read it directly in the RStudio. For instance, typing?ts_tajimagives you the help page of the tskit/slendr function implementing the tree-based computation of Tajima’s D. Whenever I discuss an optional parameter of some function (or whenever you want to modify a function’s behavior), this is where you can find more information about it.Relevant tutorials on the slendr website which you can find under the “Articles” link in the header of the webpage. In general, the website contains much more detailed information which is available in the handouts in a much more condensed form. I think this is probably too much information for this activity, but don’t hesitate to read a bit more if you’re interested!
Installation setup
If you’re using the RStudio browser-based cloud instance provided by the workshop organizes, you don’t actually have to do anything! Just log in to the RStudio session in your browser and navigate to the directory for this activity (~/workshop_materials/29_slendr) via File -> Open Project... in RStudio menu (selecting the cesky-krumlov-2025.Rproj project file in that directory will open a fresh R session for that project in your RStudio). Then you’re good to go!
Organization of the exercises
For each exercise, you will get a brief explanation of the problem at hand and some information about functions that could be useful to solve the exercise. The concrete, specific task will be always written like this in bold. As you work on each part of each exercises, look for these bold lines.
Your goal for each exercise is to write a complete R script script (in RStudio File -> New file -> R script). I suggest you save the script for each exercise as exercise1.R, exercise2.R, etc., just to keep things tidy and easy to troubleshoot if needed.
Each exercise is composed of individual parts, which are designed to build one upon the other in the order they are specified. As you progress through sequential parts of each exercise, you will be adding code to your script for that exercise.
Don’t worry about the number of exercises you’re “supposed to do”
How far we manage to crunch through the content of this session will depend on everybody’s level of confidence in programming and population genetics. The contents of the activities as a whole are designed so that no matter the moment at which we run out of time, you will take home enough knowledge of slendr to be able to immediately apply it to your own projects.
To avoid anyone getting overwhelmed, I’ll be stopping the activity at some “checkpoints”, to check in with how everything is going and quickly go through the example solutions for the parts until that point, and give a quick explanation.
If you find yourself way ahead, there are bonus exercises (which I will not be going through with everyone) – those should provide at entertainment and a bit more challenge for you while the rest of us are catching up. If you’re really far ahead, just continue to the next exercise without waiting for the rest of us.
Note on the programming aspect of the exercises
All the exercises will involve “real coding”! If you’ve never really programmed entire scripts before, this could feel a little intimidating. Don’t worry. If you’re ever lost, just take a peek at the solutions which are (by default hidden) under each exercise part. Always try to work on a solution on your own, but never let this be a barrier to your progress. Feel free to copy-paste bits of my solutions into your own scripts.
If you find yourself totally lost, don’t hesitate to read my solutions from the get go, copy-pasting them into your own script in the RStudio, executing them line by line, and trying to understand what’s going on. It’s the understanding that we’re after here. Whether or not you can implement everything yourself from scratch does not actually matter at all. (Nobody in history has learned to program from scratch. Everyone started by copy-pasting code examples written by someone else. So feel free to do the same!)
Exercise 1: Programming demographic models with slendr
Part 1: Building a demographic model in slendr
Use functions such as population(), gene_flow(), and compile_model(), which we discussed in the “crash course” at the start of this session, to program the following toy model of human demographic history in slendr. (Apologies for my bad handwriting and poor artistic ability.)
Note: You could easily program the model so that different ancestral populations are represented by separate population() commands (i.e., your model would start with a population called “human_chimp_ancestor” from which a “CHIMP” and “hominin_ancestor” populations would split at 6 Mya, etc.) but generally this is too annoying to do and requires too much code.
Feel free to write the model so that “CHIMP” is the first population, then “AFR” population splits from it at 6 Mya, etc… Although it probably isn’t the most accurate way to describe the real evolutionary history, it simplifies the coding significantly.
[Mya = million years ago; kya = thousand years ago]
Hint: Create a new script exercise1.R in your RStudio session using the following “template”. Then add a sequence of appropriate population() calls using the syntax from the introductory slides (using the parent = <pop> argument for programming splits of daughter populations – which will be all except the CHIMP lineage in our example), etc.
Note: With slendr you can specify time in whatever format is more convenient or readable for your model. For instance here, because we’re dealing with historical events which are commonly expressed in times given as”years ago”, we can write them in a decreasing order – i.e. 7Mya → 6Mya → …, as shown above – or, in terms of R code, 7e6 (or 7000000), 6e6 (6000000), etc.
In a later example, you will see that you can also encode the events in the time direction going “forward” (i.e., the first event starting in generation 1, a following event in generation 42, and so on).
Hint: Remember that slendr is designed with interactivity in mind! When you write a chunk of code (such as a command to create a population through a population split, or model compilation to create a model object), execute that bit of code in the R console and inspect the summary information printed by evaluating the respective R object you just created. You can either copy-pasted stuff from your script to the R console, or use a convenient RStudio shortcut like Ctrl+Enter (Linux and Windows), or Cmd+Enter (Mac).
Part 2: Inspecting the model visually
To visualize a slendr model, you use the function plot_model(). Plot your compiled model to make sure you programmed it correctly! Your figure should roughly correspond to my doodle above.
Note: Plotting of models in slendr can be sometimes a little wonky, especially if many things are happening at once. When plotting your model, experiment with arguments log = TRUE, proportions = TRUE, gene_flow = TRUE. Check ?plot_model for more information on these.
Part 3: Simulating genomic data
Once you have a compiled slendr model stored in an R variable (from now on, model will always mean a variable containing a compiled slendr model object relevant for the given exercise, for simplicity), we can simulate data from it. By default, slendr models always produce a tree sequence.
Note: Tree sequence provides an extremely efficient means to store and work with genomic data at a massive scale. However, you can always get simulated data even in traditional file formats, such as VCF, EIGENSTRAT, or a plain old table of ancestral/derived genotypes.
In this activity we will be only working with tree sequences, because it’s much easier and faster to get interesting statistics from it directly in R.
There are two simulation engines built into slendr implemented by functions msprime() and slim(). For traditional, non-spatial, neutral demographic models, the engine provided by the msprime() function is much more efficient, so we’ll be using that for the time being. However, from a popgen theoretical perspective, both simulation functions will give you the same results for any given compiled slendr model (up to some level of stochastic noise, of course).
Note: Yes, this means you don’t have to write any msprime (or SLiM) code to simulate data from a slendr model!
Here’s how you can use the function to simulate a tree sequence from the model you’ve just created using compile_model() in your script:
ts <- msprime(
model,
sequence_length = <length of sequence to simulate [as bp]>,
recombination_rate = <uniform recombination rate [per bp per generation]>
)You will be seeing this kind of pattern over and over again in this exercise, so it’s a good idea to keep it in mind.
Hint: The msprime() function has also arguments debug and run which can be extremely useful for debugging.
Simulate a tree sequence from your compiled model using the msprime() engine, storing it to a variable ts as shown right above. Use sequence_length = 1e6 (so 1 Mb of sequence) and recombination_rate = 1e-8 (crossover events per base pair per generation). Then experiment with setting debug = TRUE (this prints out msprime’s own debugging summary which you might already be familiar with from your previous activity?) and then run = FALSE (this prints out a raw command-line which can run a slendr simulation in the shell).
Part 4: Inspecting the tree-sequence object
As we will see later, slendr provides an R-friendly interface to accessing a subset of tskit’s functionality for working with tree sequences and for computing various popgen statistics.
For now, type out the ts object in the terminal – what do you see? You should get a summary of a tree-sequence object that you’re familiar with from your msprime and tskit activity earlier in the day.
Note: This is a very important feature of slendr – when a simulation is concluded (doesn’t matter if it was a slim() or msprime() simulation), you will get a normal tskit object. In fact, the fact that slendr supports (so far, and likely always) only a “subset” of all of tskit’s functionality isn’t stopping you to write custom Python/tskit processing code of a tree sequence generated from a slendr model. Under the hood, a slendr simulation really is just an msprime (or SLiM) simulation! It’s just executed through a simplified interface.
The brilliance of the tree-sequence data structure rests on its elegant table-based implementation (much more information on that is here). slendr isn’t really designed to run very complex low-level manipulations of tree-sequence data (its strength lies in the convenient interface to popgen statistical functions implemented by tskit), but it does contain a couple of functions which can be useful for inspecting the lower-level nature of a tree sequence. Let’s look at a couple of them now.
Use the ts_table function to inspect the low-level table-based representation of a tree sequence. For instance, you can get the table of nodes with ts_table(ts, "nodes"), edges with ts_table(ts, "edges"), and do the same thing for “individuals”, “mutations”, and “sites”. Does your tree sequence contain any mutations? If not, why, and how can we even do any popgen with data without any mutations? As you’re doing this, take a look at at the following figure (this was made from a different tree sequence than you have, but that’s OK) to help you relate the information in the tables to a tree sequence which those tables (particularly tables of nodes and edges) implicitly encode.
This should convince you that the final product of a slendr simulation really is the same kind of tree-sequence object that you learned about in the previous activities today. You don’t have to study these tables in detail!
There are also two slendr-specific functions called ts_samples() (which retrieves the “symbolic names” and dates of all recorded individuals at the end of a simulation) and ts_nodes(). You can run them simply as ts_samples(ts) and ts_nodes(ts). How many individuals (samples) are in your tree sequence as you simulated it? How is the result of ts_nodes() different from ts_samples()?
Part 5: Scheduling sampling events
In the table produced by the ts_samples() function you saw that the tree sequence we simulated recorded everyone. It’s very unlikely, unless we’re extremely lucky, that we’ll ever have a sequence of every single individual in a population that we study. To get a little closer to the scale of the genomic data that we usually work with on a day-to-day basis, we can restrict our simulation to only record a subset of individuals.
We can precisely define which individuals (from which populations, and at which times) should be recorded in a tree sequence using the slendr function schedule_sampling(). For instance, if we have a model with some slendr populations in variables eur and afr, we can schedule the recording of 5 individuals from each at times 10000 (years ago) and 0 (present-day) (using the “years before present” direction of time in our current model of Neanderthal introgression) with the following code:
pop_schedule <- schedule_sampling(model, times = c(10000, 0), list(eur, 5), list(afr, 5))This function simply returns a data frame. As such, we can create multiple of such schedules (of arbitrary complexity and granularity), and then bind them together into a single sampling schedule with a single line of code, like this:
# Note that the `times =` argument of the `schedule_sampling()` function can be
# a vector of times like here...
ancient_times <- c(40000, 30000, 20000, 10000)
eur_samples <- schedule_sampling(model, times = ancient_times, list(eur, 1))
# ... but also just a single number like here
afr_samples <- schedule_sampling(model, times = 0, list(afr, 1))
nea_samples <- schedule_sampling(model, time = 60000, list(nea, 1))
# But whatever the means you create the individual sampling schedules with,
# you can always bind them all to a single table with the `rbind()` function
schedule <- rbind(eur_samples, afr_samples, nea_samples)
scheduleUsing the function schedule_sampling (and with the help of rbind as shown in the previous code chunk), program the sampling of the following sample sets at given times, saving it to variable called schedule:
| time | population | # individuals |
|---|---|---|
| 70000 | nea | 1 |
| 40000 | nea | 1 |
| 0 | chimp | 1 |
| 0 | afr | 5 |
| 0 | eur | 10 |
Additionally, schedule the sampling of a single eur individual at the following times:
t <- seq(40000, 2000, by = -2000)Note: You can provide a vector variable (such as t in this example) as the times = argument of schedule_sampling().
In total, you should schedule the recording of 38 individuals.
Then, verify the correctness of your overall sampling schedule by visualizing it together with your model like this:
Note: As you’ve seen above, the visualization is often a bit wonky and convoluted with overlapping elements and it can be even worse with samples added, but try to experiment with arguments to plot_model described above to make the plot a bit more helpful for sanity checking.
plot_model(model, samples = schedule)Part 6: Simulating a defined set of individuals
You have now both a compiled slendr model and a well-defined sampling schedule.
Use your combined sampling schedule stored in the schedule variable to run a new tree-sequence simulation from your model (again using the msprime() function), this time restricted to just those individuals scheduled for recording. You can do this by providing the combined sampling schedule as the samples = schedule argument of the function msprime you used above. Just replace the line(s) with your first msprime() from the previous part of this exercise with the new one, which uses the schedule for customized sampling.
Also, while you’re doing this, use the ts_mutate() function to overlay neutral mutations on the simulated tree sequence right after the msprime() call. (Take a look at the handounts for a reminder of the %>% pipe patterns I showed you.)
Inspect the tree-sequence object saved in the ts variable by typing it into the R console again (this interactivity really helps with catching nasty bugs early during the programming of your script). You can also do a similar thing via the table produced by the ts_samples() function. You should see a much smaller number of individuals being recorded, indicating that the simulation was much more efficient and produced genomic data for only the individuals of interest.
Note: When you think about it, it’s actually quite astonishing how fast msprime and tskit are when dealing with such a huge amount of sequence data from tens of thousands of individuals on a simple laptop!
Exercise 2: Computing popgen statistics on tree sequences from slendr
In this exercise, you will build on top of the results from Exercise 1. Specifically, we will learn how to compute popgen statistics on slendr-simulated tree sequences using slendr’s interface to the tskit Python module.
First, create a new R script exercise2.R and paste in the following code. This is one of the possible solutions to the Exercise 1, and it’s easier if we all use it to be on the same page from now on, starting from the same model and the same simulated tree sequence:
library(slendr)
init_env()
## The interface to all required Python modules has been activated.
chimp <- population("CHIMP", time = 7e6, N = 5000)
afr <- population("AFR", parent = chimp, time = 6e6, N = 15000)
eur <- population("EUR", parent = afr, time = 70e3, N = 3000)
nea <- population("NEA", parent = afr, time = 600e3, N = 1000, remove = 40e3)
gf <- gene_flow(from = nea, to = eur, rate = 0.03, start = 55000, end = 50000)
model <- compile_model(
populations = list(chimp, nea, afr, eur),
gene_flow = gf,
generation_time = 30
)
# We will read a cached version of a tree sequence I simulated myself
# to make sure we're all on the same page. That said, if you managed to
# do Exercise 1 on your own, feel free to stick with your own tree sequence!
ts <- ts_read(here::here("data/introgression.trees"), model = model)
cowplot::plot_grid(
plot_model(model, proportions = TRUE),
plot_model(model, proportions = TRUE, log = TRUE),
nrow = 1
)As a sanity check, let’s use a couple of tidyverse table-munging tricks to make sure the tree sequence does contain a set of sample which matches our intended sampling schedule (particularly the time series of European individuals and the two Neanderthals):
library(dplyr)
# total number of recorded individuals in the tree sequence
ts_samples(ts) %>% nrow[1] 38
# times of sampling of each recorded European individual
ts_samples(ts) %>% filter(pop == "EUR") %>% pull(time) [1] 40000 38000 36000 34000 32000 30000 28000 26000 24000 22000 20000 18000
[13] 16000 14000 12000 10000 8000 6000 4000 2000 0 0 0 0
[25] 0 0 0 0 0 0
# times of sampling of each recorded Neanderthal
ts_samples(ts) %>% filter(pop == "NEA") %>% pull(time)[1] 70000 40000
# count of individuals in each population
ts_samples(ts) %>%
group_by(pop, present_day = time == 0) %>%
tally %>%
select(pop, present_day, n) %>%
arrange(present_day)# A tibble: 5 × 3
# Groups: pop [4]
pop present_day n
<chr> <lgl> <int>
1 EUR FALSE 20
2 NEA FALSE 2
3 AFR TRUE 5
4 CHIMP TRUE 1
5 EUR TRUE 10
Note: These bits of tidyverse code are extremely helpful when you’re working with large tree sequences with many individuals as sanity checks that your sampling worked as intended. I’m listing them here in case you’ve never worked with the tidyverse family of R packages before (such as the dplyr package where filter(), group_by(), tally(), and pull() come from).
Everything looks good! Having made sure that the ts object contains the individuals we want, let’s move to the exercise.
Part 1: Computing nucleotide diversity
The toy model of ancient human history plotted above makes a fairly clear prediction of what would be the nucleotide diversity expected in the simulated populations. Compute the nucleotide diversity in all populations using the slendr function ts_diversity() in your tree sequence ts. Do you get numbers that (relatively between all populations) match what would expect from the model given the \(N_e\) that you programmed for each?
Hint: Nearly every slendr statistic function interfacing with tskit accepts a ts tree-sequence object as its first argument, with further arguments being either a vector of individual names representing a group of samples to compute a statistic on, or a (named) list of such vectors (each element of that list for a group of samples) – these lists are intended to be equivalent to the sample_sets = argument of many tskit Python methods (which you’ve learned about in your activity on tskit), except that they allow symbolic names of individuals, rather then integer indices of nodes in a tree sequence.
Although you can get all the above information by processing the table produced by the ts_samples() function, slendr provides a useful helper function ts_names() which only returns the names of individuals as a vector (or a named list of such vectors, one vector per population as shown below).
When you call it directly, you get a plain vector of individual names:
ts_names(ts) [1] "NEA_1" "EUR_1" "NEA_2" "EUR_2" "EUR_3" "EUR_4" "EUR_5"
[8] "EUR_6" "EUR_7" "EUR_8" "EUR_9" "EUR_10" "EUR_11" "EUR_12"
[15] "EUR_13" "EUR_14" "EUR_15" "EUR_16" "EUR_17" "EUR_18" "EUR_19"
[22] "EUR_20" "AFR_1" "AFR_2" "AFR_3" "AFR_4" "AFR_5" "CHIMP_1"
[29] "EUR_21" "EUR_22" "EUR_23" "EUR_24" "EUR_25" "EUR_26" "EUR_27"
[36] "EUR_28" "EUR_29" "EUR_30"
This is not super helpful, unless we want to compute some statistic for everyone in the tree sequence, regardless of their population assignment. Perhaps a bit more useful is to call the function like this, because it will produce a result which can be immediately used as the sample_sets = argument mentioned in the Hint above:
ts_names(ts, split = "pop")$AFR
[1] "AFR_1" "AFR_2" "AFR_3" "AFR_4" "AFR_5"
$CHIMP
[1] "CHIMP_1"
$EUR
[1] "EUR_1" "EUR_2" "EUR_3" "EUR_4" "EUR_5" "EUR_6" "EUR_7" "EUR_8"
[9] "EUR_9" "EUR_10" "EUR_11" "EUR_12" "EUR_13" "EUR_14" "EUR_15" "EUR_16"
[17] "EUR_17" "EUR_18" "EUR_19" "EUR_20" "EUR_21" "EUR_22" "EUR_23" "EUR_24"
[25] "EUR_25" "EUR_26" "EUR_27" "EUR_28" "EUR_29" "EUR_30"
$NEA
[1] "NEA_1" "NEA_2"
As you can see, this gave us a normal R list, with each element containing a vector of individual names in a population. Note that we can use standard R list indexing to get subsets of individuals:
names <- ts_names(ts, split = "pop")
names["NEA"]$NEA
[1] "NEA_1" "NEA_2"
names[c("EUR", "NEA")]$EUR
[1] "EUR_1" "EUR_2" "EUR_3" "EUR_4" "EUR_5" "EUR_6" "EUR_7" "EUR_8"
[9] "EUR_9" "EUR_10" "EUR_11" "EUR_12" "EUR_13" "EUR_14" "EUR_15" "EUR_16"
[17] "EUR_17" "EUR_18" "EUR_19" "EUR_20" "EUR_21" "EUR_22" "EUR_23" "EUR_24"
[25] "EUR_25" "EUR_26" "EUR_27" "EUR_28" "EUR_29" "EUR_30"
$NEA
[1] "NEA_1" "NEA_2"
etc.
Many of the following exercises will use these kinds of tricks to instruct various slendr / tskit functions to compute statistics on subsets of all individuals sub-sampled in this way.
After you computed nucleotide diversity per-population, compute it for each individual separately using the same function ts_diversity() (which, in this setting, gives you effectively the heterozygosity for each individual). If you are familiar with plotting in R, visualize the individual-based heterozygosities across all populations.
Hint: You can do this by giving a vector of names as sample_sets = (so not an R list of vectors). You could also use the data frame produced by ts_samples(ts) to get the names, just adding the heterozygosities to that data frame as a new column.
Part 2: Computing pairwise divergence
Use the function ts_divergence() to compute genetic divergence between all pairs of populations. Again, do you get results compatible with our demographic model in terms of expectation given the split times between populations as you programmed them for your model?
Hint: Again, you can use the same concept of sample_sets = we discussed in the previous part. In this case, the function computes pairwise divergence between each element of the list given as sample_sets = (i.e., for each vector of individual names).
Part 3: Detecting Neanderthal admixture in Europeans
Let’s now pretend its about 2008, we’ve sequenced the first Neanderthal genome, and we are working on a project that will change human evolution research forever. We also have the genomes of a couple of people from Africa and Europe, which we want to use to answer the most burning question of all evolutionary anthropology: “Do some people living today carry Neanderthal ancestry?”
Earlier you’ve learned about \(f\)-statistics of various kinds. You have also heard that an \(f_4\) statistic (or its equivalent \(D\) statistic) can be used as a test of “treeness”. Simply speaking, for some “quartet” of individuals or population samples, they can be used as a hypothesis test of whether the history of those samples is compatible with there not having been an introgression.
Compute the \(f_4\) test of Neanderthal introgression in EUR individuals using the slendr function ts_f4(). When you’re running it, you will have to provide individuals to compute the statistic using a slightly different format. Take a look at the help page available as ?ts_f4 for more information. When you’re computing the \(f_4\), make sure to set mode = "branch" argument of the ts_f4() function (we will get to why a bit later).
Note: By default, each slendr / tskit statistic function operates on mutations, and this will switch them to use branch length (as you might know, \(f\)-statistics are mathematically defined using branch lengths in trees and mode = "branch" does exactly that).
Hint: If you haven’t learned this in your \(f\)-statistics lecture, you want to compute (and compare) the values of these two statistics using the slendr function ts_f4():
\(f_4\)(<some African>, <another African>; <Neanderthal>, <Chimp>)
\(f_4\)(<some African>, <a test European>; <Neanderthal>, <Chimp>),
here <individual> can be the name of any individual recorded in your tree sequence, such as names you saw as name column in the table returned by ts_samples(ts) (i.e. "NEA_1" could be used as a “representative” <Neanderthal> in those equations, similarly for "CHIMP_1" as the fourth sample in the \(f_4\) quarted representing the outgroup).
To simplify things a lot, we can understand the above equations as comparing the counts of so-called BABA and ABBA allele patterns between the quarted of samples specified in the statistics:
\[ f_4(AFR, X; NEA, CHIMP) = \frac{\#BABA - \#ABBA}{\#SNPs} \]
The first \(f_4\) statistic above is not expected to give values “too different” from 0 (even in case of Neanderthal introgression into Europeans) because we don’t expect two African individuals to differ “significantly” in terms of how much alleles they share with a Neanderthal (because their ancestors never met Neanderthals!). The other should – if there was a Neanderthal introgression into Europeans some time in their history – be “significantly negative”.
Is the second of those two statistics “much more negative” than the first, as expected assuming introgression from Neanderthals into Europeans?
Why am I putting “significantly” and “much more negative” in quotes in the previous sentence? What are we missing here for this being a true hypothesis test as you might be accustomed to from computing \(f\)-statistics using a tool such as ADMIXTOOLS? (We will get to this again in the following part of this exercise.)
Part 4: Detecting Neanderthal admixture in Europeans v2.0
The fact that we don’t get something equivalent to a p-value in these kinds of simulations is generally not a problem, because we’re often interested in establishing a trend of a statistic under various conditions, and understanding when and how its expected value behaves in a certain way. If statistical noise is a problem, we work around this by computing a statistic on multiple simulation replicates or even increasing the sample sizes.
Note: To see this in practice, you can check out a paper in which I used this approach quite successfully on a related problem.
On top of that, p-value of something like an \(f\)-statistic (whether it’s significantly different from zero) is also strongly affected by quality of the data, sequencing errors, coverage, etc. (which can certainly be examined using simulations!). However, these are aspects of modeling which are quite orthogonal to the problem of investigating the expectations and trends of statistics given some underlying evolutionary model, which is what we’re after in these exercises.
That said, even in perfect simulated data, what exactly does “significantly different from zero compared to some baseline expectation” mean can be blurred by noise with simple single-individual comparisons that we did above. Let’s increase the sample size a bit to see if a statistical pattern expected in \(f_4\) statistic from our Neanderthal introgression model becomes more apparent.
Compute the first \(f_4\) statistic (the baseline expectation between a pair of Africans) and the second \(f_4\) statistic (comparing an African and a European), but this time on all recorded Africans and all recorded Europeans, respectively. Plot the distributions of those two sets of statistics. This should remove lots of the uncertainty and make a statistical trend stand out much more clearly.
Hint: Whenever you need to compute something for many things in sequence, looping is very useful. One way to do compute, say, an \(f_4\) statistic over many individuals is by using this kind of pattern using R’s looping function lapply():
# Loop over vector of individual names (variable x) and apply a given ts_f4()
# expression on each individual (note the ts_f4(..., X = x, ...) in the code)
list_f4 <- lapply(
c("ind_1", "ind_2", ...),
function(x) ts_f4(ts, W = "AFR_1", X = x, Y = "NEA_1", Z = "CHIMP_1", mode = "branch")
)
# The above gives us a list of data frames, so we need to bind them all into a
# single table for easier interpretation and visualization
df_f4 <- do.call(rbind, list_f4)Exercise 3: Simulation-based inference of \(N_e\)
So far we’ve learned how slendr provides an easy way to define demographic models in R and simulate (even very large!) tree sequences from them. This allows us to quickly verify our intuition about some popgen problem (things like “Hmmm, I wonder what would an \(f_4\) statistic look like if my model includes this particular gene-flow event?), in just a few lines of R. There have been instances in which we’ve been able to even answer questions like this directly in a meeting, pretty much on the spot! This makes slendr a very powerful “popgen calculator”.
Now let’s take things one step further. Imagine you gathered some empirical data, like an allele frequency spectrum (AFS) from a population that you study. That data was, in the real world, produced by some (hidden) biological process (demographic history) that we want to learn about. For instance, the population we study had some \(N_e\), which we don’t know the value of (the only thing we have is the observed AFS) but we want to infer that value.
Simulations can be a great tool to estimate the most likely value of such an unknown parameter. Briefly speaking, in this particular toy example, we can simulate a large number of AFS vectors (each resulting from a different assumed \(N_e\) value) and then pick just those \(N_e\) values (or just one \(N_e\) value) which produced a simulated AFS closest to the observed AFS.
This is exactly what you’ll be doing just now in Exercise 3.
Part 1: A self-contained slendr function of \(N_e \rightarrow \textrm{AFS}\)
In a new script exercise3.R write a custom R function called simulate_afs(), which will take Ne as its only parameter. Use this function to compute (and return) AFS vectors for a couple of Ne values of your choosing, but staying between Ne = 1000 and Ne = 30000 Plot those AFS vectors and observe how (and why?) do they differ based on Ne parameter you used in each respective simulation.
Hint: The function should create a one-population forward-time model (our population starting at time = 1, with the model simulation_length = 100000 and generation_time = 1 in compile_model()), simulate 10Mb tree sequence using msprime() (recombination rate 1e-8) and then overlay neutral mutations on it at mutation_rate = 1e-8), compute AFS for 10 samples and return the AFS vector as result of this custom function.
Hint: If you’ve never programmed before, the concept of a “custom function” might be very alien to you. Again, if you need help, feel free to start building your exercise3.R solution based on this “template” (just fill in missing relevant bits of slendr code that you should be already familiar with):
library(slendr)
init_env()
simulate_afs <- function(Ne) {
# In here you should write code which will:
# 1. create one population with a given Ne (provided as a function argument)
# 2. compile a model using `simulation_length =` and `generation_time =`
# 3. simulate a tree sequence
# 4. select names of 10 samples (doesn't matter which, "pop_1", "po2_", ...)
# 5. compute AFS vector from those 10 individuals using `ts_afs()`
# `result` is a variable with your 10-sample AFS vector (we remove the
# first element because it's not meaningful for our example)
return(result[-1])
}
afs_1 <- simulate_afs(Ne = 1000) # simulate AFS from a Ne = 1000 model...
plot(afs_1, type ="o") # ... and plot itNote: Remember that you should drop the first element of the AFS vector produced by ts_afs() (for instance with something like result[-1] if result contains the output of ts_afs()) technical reasons related to tskit. You don’t have to worry about that here, but you can read this for more detail.
Hint: If the above still doesn’t make any sense to you, feel free to copy-paste the function from the solution below into your script and work with that function instead!
When used in R, your custom function should work like this (the simulation is stochastic, so your numbers will be different, of course):
# This gives us a vector of singletons, doubletons, etc., etc., all the way
# to the number of fixed mutations in our sample of 10 individuals
simulate_afs(Ne = 1000) [1] 429 238 103 103 97 68 75 50 55 37 46 33 38 48 32 27 20 26 28
[20] 8
Part 2: Estimating unknown \(N_e\) from empirical AFS
Imagine you sequenced 10 samples from a population and computed the following AFS vector (which contains, sequentially, the number of singletons, doubletons, etc., in your sample from a population):
afs_observed <- c(2520, 1449, 855, 622, 530, 446, 365, 334, 349, 244,
264, 218, 133, 173, 159, 142, 167, 129, 125, 143)You know (maybe from some fossil evidence) that the population probably had a constant \(N_e\) somewhere between 1000 and 30000 for the past 100,000 generations, and had mutation and recombination rates of 1e-8 (i.e., parameters already implemented by your simulate_afs() function – how convenient!).
Use slendr simulations to guess the true (and hidden!) \(N_e\) given the observed AFS by running simulations for a range of \(N_e\) values and finding out which \(N_e\) produces the closest AFS vector to the afs_observed vector above using one of the following two approaches.
Option 1 [easy]: Plot AFS vectors for various \(N_e\) values (i.e. simulate several of them using your function
simulate_afs()), then eyeball which looks closest to the observed AFS based on the figures alone. (This is, of course, not how proper statistical inference is done, but it will be good enough for this exercie!)Option 2 [hard]: Simulate AFS vectors in steps of possible
Ne(maybelapply()?), and find the \(N_e\) which gives the closest AFS to the observed AFS based on Mean squared error.
Congratulations, you now know how to infer parameters of evolutionary models using simulations! What you just did is really very similar to how simulation-based inference is done in practice (even with methods such as ABC). Hopefully you can also see how easy does slendr make it.
This kind of approach can be used to infer all sorts of demographic parameters, even using other summary statistics that you’ve also learned to compute… including selection parameters, which we delve into in the next exercise.
Exercise 4: Simulating (and detecting) natural selection
The primary motivation for designing slendr was to make demographic modelling in R as trivially easy and fast as possible, focusing exclusively on neutral models. However, as slendr became popular, people have been asking for the possibility of simulating natural selection. After all, a large part of slendr’s functionality deals with population genetic models across geographical landscapes, which requires SLiM. So why not support selection simulations using slendr as well?
In December 2024 I caved in and added support for modifying slendr demographic models with bits of SLiM code, which allows simulating pretty much any arbitrary selection scenario you might be interested in.
This exercise is a quick demonstration of how this works and how you might simulate selection using slendr. We will do this using another toy model of ancient human history, which we will first use as a basis for simulating the frequency trajectory of an allele under positive selection, and then implementing a toy selection scan using Tajima’s D.
To speed things up, create a new exercise4.R script and copy the following code as a starting point for this exercise:
library(slendr)
init_env(quiet = TRUE)
# This line sources a script in which I provide a few useful helper functions
# which you can use in this exercise
source(here::here("utils.R"))
# African ancestral population
afr <- population("AFR", time = 65000, N = 5000)
# First migrants out of Africa
ooa <- population("OOA", parent = afr, time = 60000, N = 5000, remove = 27000)
# Eastern hunter-gatherers
ehg <- population("EHG", parent = ooa, time = 28000, N = 5000, remove = 6000)
# European population
eur <- population("EUR", parent = ehg, time = 25000, N = 5000)
# Anatolian farmers
ana <- population("ANA", time = 28000, N = 5000, parent = ooa, remove = 4000)
# Yamnaya steppe population
yam <- population("YAM", time = 8000, N = 5000, parent = ehg, remove = 2500)
# Define gene-flow events
gf <- list(
gene_flow(from = ana, to = yam, rate = 0.75, start = 7500, end = 6000),
gene_flow(from = ana, to = eur, rate = 0.5, start = 6000, end = 5000),
gene_flow(from = yam, to = eur, rate = 0.6, start = 4000, end = 3500)
)
# Compile all populations into a single slendr model object
model <- compile_model(
populations = list(afr, ooa, ehg, eur, ana, yam),
gene_flow = gf, generation_time = 30
)
# Schedule the sampling from four European populations roughly before their
# disappearance (or before the end of the simulation)
schedule <- rbind(
schedule_sampling(model, times = 0, list(eur, 50)),
schedule_sampling(model, times = 6000, list(ehg, 50)),
schedule_sampling(model, times = 4000, list(ana, 50)),
schedule_sampling(model, times = 2500, list(yam, 50))
)Next, visualize the demographic model. If you did a bit of work in human population genetics, you might recognize it as a very simplified model of demographic history of Europe over the past 50 thousand years or so. As you can see, we are recording 50 individuals from four populations – for Europeans we sample 50 individuals at “present-day”, for the remaining populations we’re recording 50 individuals just before their disappearance. Also note that there’s quite a bit of gene-flow! This was an important thing we’ve learned about human history in the past 10 years or so – everyone is mixed with pretty much everyone, there isn’t (and never was) anything as a “pure population”.
Note: We didn’t discuss it earlier, but slendr also provides the option to specify a remove = argument in a population() call which instructs the simulation engine to delete a population from a simulation at a given point. For our msprime() simulations in earlier examples it wasn’t really important, but for the slim() simulation we will be running below, we want to make a population extinct at a certain timepoint. Which is why our ancient populations in the starting script model have the remove = parameter specified.
plot_model(model, proportions = TRUE, samples = schedule)Part 1: Simulating a tree sequence and computing Tajima’s D
Although the point of this exercise is to simulate selection, let’s first simulate a normal neutral model using slendr’s msprime() engine as a sanity check. Simulate 10 Mb of sequence with a recombination rate 1e-8 and a sampling schedule defined above. Let’s not worry about adding any mutations, just to change things up a little bit. We’ll be working with branch-based statistics here (which means adding mode = "branch" whenever we will be computing a statistic, such as Tajima’s D).
Inspect the table of all individuals recorded in our tree sequence using the function ts_samples(), making sure we have all the individuals scheduled for tree-sequence recording. (Again, there’s no such a thing as too many sanity checks when doing research!)
As you’ve already learned in an earlier exercise, tskit functions in slendr generally operate on vectors (or lists) of individual names, like those produced by ts_names() above. Get a vector of names of individuals in every population recorded in the tree sequence, then use this to compute Tajima’s D using the slendr function ts_tajima(). (Use the same approach as you have with ts_diversity() or ts_divergence() above, using the list of names of individuals as the sample_sets = argument for ts_tajima()). Do you see any striking differences in the Tajima’s D values across populations? Check this for some general guidance.
Part 2: Computing Tajima’s D in windows
Let’s take this one step forward. Even if there is a locus under positive selection somewhere along our chromosome, it might be quite unlikely that we would find a Tajima’s D value significant enough for the entire chromosome (which is basically what we did in Part 1 now). Fortunately, thanks to the flexibility of the tskit module, the slendr function ts_tajima() has an argument windows =, which allows us to specify the coordinates of windows into which a sequence should be broken into, with Tajima’s D computed separately for each window. Perhaps this will allow us to see the impact of positive selection after we get to adding selection to our model. So let’s first built some code towards that.
Define a variable windows which will contain a vector of coordinates of 100 windows, starting at position 0, and ending at position 10e6 (i.e., the end of our chromosome). Then provide this variable as the windows = argument of ts_tajima() on a new, separate line of your script. Save the result of ts_tajima() into the variable tajima_wins, and inspect its contents in the R console.
Hint: You can use the R function seq() and its argument length.out = 100, to create the coordinates of window boundaries very easily.
The default output format of ts_tajima() is not super user-friendly. Process the result using a helper function process_tajima(tajima_wins) that I provided for you (perhaps save it as tajima_df), and visualize it using another of my helper functions plot_tajima(tajima_df).
Note: Making the process_tajima() and plot_tajima() function available in your R code is the purpose of the source(here::here("utils.R")) command at the beginning of your script for this exercise.
Part 3: Adding positive selection to the base demographic model
Although primarily designed for neutral demographic models, slendr allows optional simulation of natural selection by providing a “SLiM extension code snippet” with customization SLiM code as an optional argument extension = of compile_model() (a function you’re closely familiar with at this point).
Unfortunately we don’t have any space to explain SLiM here (and I have no idea, at the time of writing, whether or not you will have worked with SLiM earlier in this workshop). Suffice to say that SLiM is another very popular population genetic simulator software which allows simulation of selection, and which requires you to write custom code in a different programming language called Eidos.
Take a look at the file slim_extension.txt provided in your working directory (it’s also part of the GitHub repository here). If you worked with SLiM before, glance through the script casually and see if it makes any sense to you. If you have not worked with SLiM before, look for the strange {elements} in curly brackets in the first ten lines of the script. Those are the parameters of the selection model we will be customizing the standard neutral demographic model we started with in the next step.
Specifically, when you inspect the slim_extension.txt file, you can see that this “SLiM extension script” I provided for you has three parameters:
origin_pop– in which population should a beneficial allele appear,s– what should be the selection coefficient of the beneficial allele, andonset_time– at which time should the allele appear in theorigin_pop.
However, at the start, the SLiM extension snippet doesn’t contain any concrete values of those parameters, but only their {origin_pop}, {s}, and {onset_time} placeholders.
Use the slendr function substitute_values() to substitute concrete values for those parameters like this:
extension <- substitute_values(
template = here::here("slim_extension.txt"),
origin_pop = "EUR",
s = 0.15,
onset_time = 12000
)
extension[1] "/var/folders/70/b_q2zdh116b9pfg29p03sx600000gn/T//RtmpjCUDOq/filed90b570122e2"
You can see that substitute_values() returned a path to a file. Take a look at that file in your terminal – you should see each of the three {placeholder} parameters replaced with a concrete given value.
And that’s all the extra work we need to turn our purely neutral demographic slendr model into a model which includes natural selection! (In this case, only a simple selection acting on a single locus, as you’ll see later, but this can be generalized to any imaginable selection scenario.)
How do we use the SLiM extension for our simulation? It’s very simple – we just have to provide the extension variable as an additional argument of good old compile_model(). This will compile a new slendr model which will now include the new functionality for simulating natural selection:
Compile a new model of the history of populations afr, ooa, ehg, etc., by following the instructions above, providing a new extension = argument to the compile_model() function.
Part 4: Running a selection simulation using slim()
Now we can finally run our selection simulation!
There are two modifications to our previous simulation workflows:
- Because we need to run a non-neutral simulation, we have to switch from using the
msprime()slendr engine toslim(). The latter can still interpret the same demographic model we programmed in R, just like themsprime()engine can, but will run the model using SLiM (and thus leveraging the new SLiM extension code that we have customized usingsubstitute_values()above). We simply do this by switching from this:
ts <- msprime(model, sequence_length = 10e6, recombination_rate = 1e-8, samples = schedule)to this:
ts <- slim(model, sequence_length = 10e6, recombination_rate = 1e-8, samples = schedule)As you can see, you don’t have to modify anything in your model code, just switching from msprime to slim in the line of code which produces the simulation result.
- The customized model will not only produce a tree sequence, but will also generate a table of allele frequencies in each population (SLiM experts might have noticed the revelant SLiM code when they were inspecting
slim_extension.txt). We need to be able to load both of these files after the simulation and thus need a path to a location we can find those files. We can do this by calling theslim()function aspath <- slim(..., path = TRUE)(so with the extrapath =argument). This will return a path to where theslim()engine saved all files with our desired results.
Run a simulation from the modified model of selection with the slim() engine as instructed in points number 1. and 2. above, then use the list.files(path) function in R to take a look in the directory. Which files were produced by the simulation?
Part 5: Investigating allele frequency trajectories
Use another helper function read_trajectory(path) which I provided for this exercise to read the simulated frequency trajectories of the positively selected mutation in all of our populations into a variable traj_df. Then run a second helper function plot_trajectory(traj_df) to inspect the trajectories visually.
Recall that you used the function substitute_values() to parametrize your selection model so that the allele under selection occurs in Europeans 15 thousand years ago, and is programmed to be under very strong selection of \(s = 0.15\). Do the trajectories visualized by plot_trajectory() make sense given the demographic model of European prehistory plotted above?
Part 6: Tajima’s D (genome-wide and window-based) from the selection model
Recall that your simulation run saved results in the location stored in the path variable:
list.files(path)[1] "slim.trees" "trajectory.tsv"
From this path, we’ve already successfuly investigated the frequency trajectories.
Now let’s compute Tajima’s D on the tree sequence simulated from our selection model. Hopefully we should see an interesting pattern in our selection scan? For instance, we don’t know yet where in the genome is the putative locus under selection!
To read a tree sequence simulated with slim() by our customized selection setup, we need to do a bit of work. To simplify things a bit, here’s the R code which makes it possible to do. Just copy it in your exercise4.R script as it is:
# Let's use my own saved simulation results, so that we're all on the
# same page going forward
path <- here::here("data/selection")
ts <-
file.path(path, "slim.trees") %>% # 1. compose full path to the slim.trees file
ts_read(model) %>% # 2. read the tree sequence file into R
ts_recapitate(Ne = 5000, recombination_rate = 1e-8) # 3. perform recapitationVery briefly, because our tree sequence was generated by SLiM, it’s very likely that not all genealogies along the simulated genome will be fully coalesced (i.e., not all tree will have a single root). To explain why this is the case is out of the scope of this session, but read here if you’re interested in learning more. For the time being, it suffices to say that we can pass the (uncoalesced) tree sequence into the ts_recapitate() function, which then takes a SLiM tree sequence and simulates all necessary “ancestral history” that was missing on the uncoalesced trees, thus ensuring that the entire tree sequence is fully coalesced and can be correctly computed on.
Now that you have a ts tree sequence object resulting from a new selection simulation run, repeat the analyses of genome-wide and window-based Tajima’s D from Part 1 and Part 2 of this exercise, again using the provided helper functions process_tajima() and plot_tajima(). Can you identify which locus has been the likely focal point of the positive selection? Which population shows evidence of selection? Which doesn’t and why (look again at the visualization of the demographic model above)?
(Homework) Exercise 5: Your own model
Do you have a favourite model organism? Take some of the most important features of its demographic history and program them as a new slendr model in a new R script.
Then, think about some interesting population genetic statistics that you either know from the literature or that you perhaps computed on your own empirical data. Try to compute them on a tree sequence simulated from your model using msprime(). Do you get results which are in the ballpark of your expectations?
More information
The slendr paper is published in PCI EvolBiol (it has a lot of explanation, motivation, and plenty of examples focusing on simulations on geographic landscapes!)
The website of the slendr package is here
Detailed tutorials are under Articles on the website
slendr GitHub repo (submit bug reports!) is here
Check out my new WIP package demografr
Useful for simulation-based inference (ABC, grid-search, genetic algorithms, etc.)
It builds on slendr! You now know 90% of what you need to do inference with it.
This is the coolest thing I’ve ever built so definitely take a look!